# Efficient Inference Optimization

AM Thinking V1 GGUF
Apache-2.0
AM-Thinking-v1 is a text generation model based on the GGUF format, suitable for various natural language processing tasks.
Large Language Model Transformers
A
Mungert
1,234
1
Qwen3 235B A22B Exl3
Apache-2.0
The Exllamav3 quantized version of Qwen3-235B-A22B, offering multiple quantization options to optimize model size and performance.
Large Language Model
Q
MikeRoz
37
4
Llama 3.1 Nemotron Nano 4B V1.1
Other
Llama-3.1-Nemotron-Nano-4B-v1.1 is a compressed and optimized large language model based on Llama 3.1, focusing on inference and dialogue tasks, supporting 128K context length, and compatible with a single RTX GPU.
Large Language Model Transformers English
L
nvidia
5,714
61
Falcon H1 34B Instruct
Other
Falcon-H1 is an efficient hybrid architecture language model developed by TII, combining the advantages of Transformers and Mamba architectures, supporting English and multilingual tasks.
Large Language Model Transformers
F
tiiuae
2,454
28
Falcon H1 34B Base
Other
Falcon-H1 is a hybrid architecture language model developed by the UAE's Technology Innovation Institute, combining the strengths of Transformers and Mamba architectures, supporting multilingual processing.
Large Language Model Transformers Supports Multiple Languages
F
tiiuae
175
7
Open Thoughts OpenThinker2 7B GGUF
Apache-2.0
Quantized version of OpenThinker2-7B, using llama.cpp for quantization, suitable for text generation tasks.
Large Language Model
O
bartowski
1,023
5
Nemotron H 8B Base 8K
Other
The NVIDIA Nemotron-H-8B-Base-8K is a large language model (LLM) developed by NVIDIA, designed to generate completions for given text fragments. The model adopts a hybrid architecture primarily composed of Mamba-2 and MLP layers, incorporating only four attention layers. It supports a context length of 8K and covers multiple languages including English, German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, and Chinese.
Large Language Model Transformers Supports Multiple Languages
N
nvidia
5,437
38
Llama 3.1 Nemotron Nano 8B V1
Other
An inference and dialogue model optimized from Meta Llama-3.1-8B-Instruct, supporting 128K context length, balancing efficiency and performance
Large Language Model Transformers English
L
nvidia
60.52k
145
Gemma 3 12b It Q5 K S GGUF
This is the GGUF quantized version of Google Gemma 3B model, suitable for local inference and supports text generation tasks.
Large Language Model
G
NikolayKozloff
16
1
Gemma 3 27b It Q4 K M GGUF
This model is a GGUF format version converted from Google's Gemma 3 27B IT model, suitable for local inference.
Large Language Model
G
paultimothymooney
299
2
Bge Reranker V2 M3 Q8 0 GGUF
Apache-2.0
This is a GGUF format text ranking model converted from the BAAI/bge-reranker-v2-m3 model, supporting multilingual text embedding inference.
Text Embedding Other
B
pqnet
54
0
Formatclassifier
The FormatClassifier model categorizes web content into 24 classes based on URL and text content.
Text Classification Transformers Other
F
WebOrganizer
2,429
5
Topicclassifier
A topic classification model fine-tuned based on gte-base-en-v1.5, capable of classifying web content into 24 categories
Text Classification Transformers Other
T
WebOrganizer
2,288
9
Plamo 2 8b
Other
PLaMo 2 8B is an 8-billion-parameter hybrid architecture language model developed by Preferred Elements, supporting English and Japanese text generation.
Large Language Model Transformers Supports Multiple Languages
P
pfnet
401
19
Plamo 2 1b
Apache-2.0
PLaMo 2 1B is a 1-billion-parameter model developed by Preferred Elements, pretrained on English and Japanese datasets, featuring a hybrid architecture combining Mamba and sliding window attention mechanisms.
Large Language Model Transformers Supports Multiple Languages
P
pfnet
1,051
31
Modernbert Large Squad2 V0.1
Apache-2.0
A QA model fine-tuned on SQuAD 2.0 dataset based on ModernBERT-large, supporting long-context processing
Question Answering System Transformers
M
Praise2112
19
2
Ichigo Llama3.1 S Instruct V0.4 GGUF
Apache-2.0
A statically quantized model based on Menlo/Ichigo-llama3.1-s-instruct-v0.4, offering multiple quantization versions to suit different hardware requirements.
Large Language Model English
I
mradermacher
369
1
Deepseek V2 Lite
DeepSeek-V2-Lite is a cost-efficient Mixture of Experts (MoE) language model with a total of 16B parameters and 2.4B active parameters, supporting a 32k context length.
Large Language Model Transformers
D
ZZichen
20
1
Meta Llama 3 8B Instruct Function Calling Json Mode
This model is fine-tuned based on meta-llama/Meta-Llama-3-8B-Instruct, specifically designed for function calling and JSON mode.
Large Language Model Transformers English
M
hiieu
188
75
Minicpm MoE 8x2B
MiniCPM-MoE-8x2B is a Transformer-based Mixture of Experts (MoE) language model, designed with 8 expert modules where each token activates 2 experts for processing.
Large Language Model Transformers
M
openbmb
6,377
41
Decilm 6b Instruct
Other
DeciLM 6B-Instruct Model is an English language model specifically designed for short-format instruction following, trained using LoRA fine-tuning technology based on DeciLM 6B
Large Language Model Transformers English
D
Deci
105
134
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase